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Abstract 

This paper presents a computational procedure for extracting demography data, 
mining patterns of human preferences, and measuring the topology of a virtual 
network. The network was created from the personal and relationships data of 
an online Internet-based community, where persons are considered nodes in the 
network, and relationships between persons are considered edges. A community 
of Friendster users whose listed hometown is Los Banos, Laguna was used as a 
test bed for the methodology. The method was able to provide the following 
demographic, preferential, and topological results about the test bed: 

1. There are more female users (52.34%) than male (47.66%); 

2. Ages 15-25 of both genders compose 68% of the users, with ages 26-40 
following at 28%, ages 41-85 at 4%, and senior citizens (64-85 years old) at 
1 %; 

3. Homophily (i.e., birds-of-a-feather adage) is observed in the preferences of 
users with respect to age levels, such that they are strongly biased towards 
being friends with users of a similar age; 

4. There is heterophily in gender preference such that friendship among users 
of the opposite gender occurs more often. 

5. The friendship network is well-connected and robust to node removal, such 
that users can still reach other users through another friend’s circle of friends, 
even if another user leaves the network; 

6. It exhibits a small-world characteristic with an average path length of 4.5 
(maximum=12) among connected users, shorter than the well-known six 
degrees of separation [TO]; And 

7. The network exhibits a scale-free characteristics with heavily-tailed power- 
law distribution (with the power A = —1.02 and = 0.84) suggesting the 
presence of many users acting as the network hubs. 
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The methodology was successful in providing important data from a virtual com¬ 
munity which can be used by several researchers in the fields of statistics, mathe¬ 
matics, physics, social sciences, and computer science. 


1. Introduction 

In the beginning of the 21st century, the pervasive nature of the Internet has made its 
way into the lives of the connected humans on the planet through various web services 
called social networking. Examples of social networking sites are Friendstei0, MySpac^, 
and FaceboolJl. These web services allow users to publish personal information such as 
age, gender, relationship status, geographic location, and to mark other users as friends. 
Thus, these social networking sites provide unprecedented opportunities to capture and 
analyze the demography, user preference, and the topology of a community on a large 
scale without resorting to the traditional procedure of surveying a population sample. 

In recent years, several efforts have been done to capture and analyze the social structure 
of virtual communities. Two very recent works that are worth mentioning here are 
those by Zinoviev m and by Leskovec and Horvitz [1]. Zinoviev atteimted to map 
the topology and geometry of a Russian social network called Moi Kru^ by utilizing 
several graph theoretic metrics such as node degree distribution and path lengths. In 
his effort to dehne the community structure and the boundaries and inner areas of a 
cliche, he introduced the concept of the dense core and the local density. He also used 
these concepts to identify the socially popular and marginal users m- Leskovec and 
Horvitz [1], on the other hand, examined and analyzed the characteristics and patterns 
that emerge from the collective dynamics of large number of people participating in a 
high-level communication activities using the Microsoft instant-messaging system. From 
their data, they constructed a communication network with 180 million nodes and 1.3 
billion undirected edges. After analysis of the network, they found out that the average 
degree of separation between nodes agree with the six degrees of separation adage [1] . 

In this effort, a computerized methodology to extract the demography, user prefer¬ 
ences, and network topology of a virtual community was designed, implemented, and 
tested using as test bed the virtual community of Friendster users from a local munici¬ 
pality. The methodology is centered on a customized web robot designed to crawl and 
extract the demographic data and other statistics of community of users. The method 
includes creating an undirected network to mathematically map (and visually graph) 


^http://www.friendster.com 
^http://www.myspace.com 
^http://www.facebook.com 
^Moi Krug literally means My Circle 
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the relationship data using nodes and edges to represent the users and their relation¬ 
ships, respectively. From this network, several graph metrics were computed such as the 
average path length and degree distribution. 

Presented in this paper is a methodology for extracting the demographic and preferential 
patterns of users in a virtual community. Using the 7,172 Friendster users from Los 
Banos, Laguna as test bed for the method, the resulting network was analyzed based on 
centrality metrics. Section [2] briefly describes the test bed social networking site and the 
kind of data that can be extracted from its database. Section [3] describes the web robot 
employed to extract the demography and preferential data of the selected virtual users, 
while Section 0] discusses the results of the extraction, as well as the network analysis 
performed. Lastly, Section |5] summarizes the hndings of this study. 


2. Friendster: The Test Bed Social Networking Site 

Friendster is a web-based social networking service founded in 2002 by Jonathan 
Abrams [5]. Based on the Circle of Friends and Web of Friends techniques for con¬ 
necting humans in a virtual environment [7], the Friendster network demonstrate the 
small world phenomenon [8]. Currently, the Friendster database has an estimated 70 
Million records corresponding to human users worldwide [2]. 

The Friendster system uses a method and a device to calculate, display, and perform 
actions on relationships on the social network. The method, termed the Web of Friends, 
combines the methodologies in the Circle of Friends and the Web of Contacts to collect 
descriptive information about their users. The method also allows the Friendster users to 
tag other users as their friends. Using the method, the descriptive information, as well as 
the friendship data, are integrated and processed to reveal a series of social relationships 
connecting any two users in the network. Thus, a specihc user can determine the optimal 
relationship path (i.e., a contact pathway) to reach desired individuals. Moreover, a 
communications tool was added to allow users to be introduced (or introduce themselves) 
and initiate direct communication with others [1]. 

3. Extracting Data from a Virtual Community 

3.1. The Web Robot 

A web robot was created using Perl scripts and a combination of the Linux command 
line utility programs grep [6] and wget [5]. A Friendster account V was hrst created 
because only currently logged-in users can view the prohles of another Friendster user. 
Here, V is actually an account of a human Friendster user and is not just a dummy 


3 


account. Friendster has already purged their database of Pretendsters, Fakesters, and 
Fraudsters [9], so it is guaranteed that the accounts that the web robot is crawling and 
extracting data from are that of real and living humans’. The web robot uses the cookie 
hie of the web browser being used by the currently logged-in V. Thus, in the point of 
view of the Friendster web server, the web robot is nothing but V himself, navigating 
through the network of friends of Friendster users. 


3.2. Los Banos Friendster Users 

The Friendster’s search tool was utilized to extract from the Friendster database the 
accounts of those users whose listed hometown is Los Banos, Laguna. The search 
parameter used were those that will list all genders, the widest age range, and all toggle 
parameters set to on. These toggle parameters are those that refer to friendship pref¬ 
erences and relationship status (Figure H]). The Friendster web tool outputs an array 
L = {lo, h, ■ ■ ■, Ip-i} of p pages to list N unique accounts. The pages Iq, h, ■ ■ ■, ^p- 2 , each 
lists 10 unique accounts, while the last page /p_i lists p modulo N accounts. The web 
robot started crawling /q and used its URL to crawl through the succeeding pages /*, 
Vz = 1,... ,p — 1, by changing only a parameter in the URL. For each page /*, the web 
robot automatically extracted the account number, user name, age, gender, and rela¬ 
tionship status of each user. These information are then stored in a database table 
such that at the end of the crawl, had N unique records corresponding to unique 
Friendster accounts. 

During the web robot crawl, each user’s list of friends were also crawled and their 
respective data extracted and stored in A„. A user’s friends list, however, was stored 
in a separate database table A/, taking note only of the user’s account number and 
the friends’ respective account numbers. The tables A„ and Af are in one-to-many 
relationship on the user’s account number as foreign key. 


3.3. Extracting Demography Data 

The following demographic data were extracted from A^: 

1. Number Ng and percentage Pg of users by gender 5g] 

2. Number Na and percentage Pa of users by age group 5^; 

3. Number Nr and percentage Pr of users by relationship status 

4. Number A^xa and percentage Pgxa of users by 5g and 5a] 

5. Number Ng^r and percentage F^xr of users by 5g and 5r] 

6. Number Naxr and percentage Paxr of users by 5a and 5r; and 
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7. Number Ng^axr and percentage Pgxaxr of users by 6g, 6a and dr- 

Here, it is easy to note that the percentage statistics can be derived using the frequency 
statistics via the following equations: 

P, = ^X100%; 

N 

Pa = ^xl00%; 

N 

= ^xl00%; 

Pgxa = ^Xl00%; 

Pgxr = %^X100%; 

N 

Paxr = X 100%; and 

Pgxaxr = X 100%. 

N 

The basic frequency statistics are described as follows: 

1 . Ng is either A^male or iVfemale- Note that iV^ale + Nfemale = N. 

2. Na is one of Ng — 25 , N^ 26 — 4 o, N 41—64 or —so- Note here that Ng —25 + 

N 26 -40 + N4I-64 + -80 = N. 

3. Nr is any of Ngiagie, N’married, -NiAR, or N^nk, where lAR and unk mean in a rela¬ 
tionship and unknown or disclosed relationship status, respectively. Again, note 
that A^single T -Nmarried T AjaR T Aunk ~ A . 

The compound frequency statistics were just the frequency of users of a given combina¬ 
tion of attribute values. For example, the number of single, male users in the age group 
8-25 years old can be computed as A^aiexs— 25 xsingie = |<5maie (1 <^8 —25 (14ingle|• 

3.4. Extracting Patterns of Preferences 

The following preferential patterns were extracted from the tables and A/: 

1. Preference with respect to gender; 

2. Preference with respect to age; and 

3. Preference with respect to relationship status. 
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To analyze if there are patterns in gender preferences, the frequency of male-male, 
female-female, and male-female relationships were extracted using an SQL query on 
the inner join of A„ and Aj. Similarly, the patterns of age group preferences, as well as 
that of the relationship status, were extracted using an SQL query on the same inner 
joins, but on different respective table attributes. 


3.5. Creating and Analyzing the Friendship Network 

The friendship network was created using the data in A/. Taking each user as node, 
and the relationship between users as edges, a N x N adjacency matrix R was created. 
The element G R takes the value 1 if a relationship record between users i and j 
exists in Af, otherwise rjj = 0. 

From R, the following network metrics were computed: 

1. The minimum, average, and maximum number of edges connected to any arbitrary 
node; 

2. The minimum, average and maximum path lengths between two arbitrary nodes; 
and 

3. The degree distribution of the number of edges an arbitrary node has. 

4. Results and Discussion 

4.1. Demography 

Figure [2] shows the number (and percentage) of users by (a) gender, (b) relationship 
status, and (c) relationship status and gender. In Figure [2^ Female users outnumber 
male users by 4.78% (336 users). In Figure [2 )d, single users dominate the network at 
56.07%. Users who are in lAR, married, or with undisclosed relationship status compose 
14.26%, 13.98%, and 15.69%, respectively. In Figure [2}:;, female users who are lAR, 
married, single, and even those with undisclosed relationship status outnumber males 
by 0.38%, 1.13%, 3.05%, and 0.13%, respectively. 

Figure E] shows the number of users (a) by age and (b) by age and relationship status. 
The trend shows that the dominant users are in the late teens and mid-twenties as shown 
by the two, equally-high peaks. This pattern is consistent with the pattern of the single 
users. Among the married users, late 20’s and early 30’s are the dominant age. Among 
the users who are lAR, the dominant age is early 20’s. Figure H] shows the number of 
users (a) by age and gender, and (b-c) by age, gender, and relationship status. The 
trend among female users generally follows that of the single’s, but there are more users 
among the late teens than that of the early 20’s. The trend among male users is also 
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somewhat similar to that of the single’s, but there are more users among the early 20’s 
than that of the late teens. 


4.2. Patterns of Friendship Preferences 

The pattern of friendship preferences with respect to gender was extracted from the 
network created. The frequency of friendships between two male users, between two 
female users, and between a male and a female user were counted. Figure |5] summarizes 
the results of the frequency analysis. The hgure shows that relationships with opposite 
gender occur more often than with the same gender. Thus, it can be deduced that users 
prefer to be in a relationship with opposite gender, suggesting heterophily in gender 
preferences. 

Figure [6] shows the frequency of relationships between users whose age difference ranges 
from 0 to 75 years. The pattern shows that relationships with users whose age is at most 
10 years occur more often than whose age difference is greater than 10 years. Thus, it 
can be deduced from the pattern that users select to be friends with users of the same 
age, suggesting homophily in age level preference. 

Figure [7] shows the frequency of relationships between users with respect to relationship 
status. The pattern shows that half of the singles prefer to be friends with singles 
(i.e, homophily) while half prefer to be friends with people who are already lAR (i.e., 
heterophily); lAR users prefer to be friends with those who are already lAR. 

4.3. Centrality and Topology of the Social Network 

The results of path analysis show that the network has a maximum and an average 
path length of 12 and 4.5, respectively. This suggests that one user of the network 
can reach another user through a friend of a friend via an average of 4.5 persons, and 
that the person is guaranteed to be reached via a minimum of 12 persons. Figure E] 
shows the distribution of the number of friends (number of friends x frequency) in the 
log-log scale. The distribution obeys the power law distribution, thus the network is 
considered scale-free. The presence of heavy tail in the distribution suggests that many 
users are acting as the network hubs. This implies that information sources must target 
these hubs because they have a wide sphere of influence as they are considered opinion 
leaders. Information (or gossip, or epidemic) spread faster into the network if they 
originate from these hubs. 
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5. Summary and Conclusion 


This paper presents a computational methodology for extracting demography data, 
mining patterns of human preferences, and measuring the topology of a virtual network. 
Using the Friendster network of Los Banos residents as a test-bed for the methodology, 
the following facts were extracted: 

1. There are more female community users (52.34%) than male (47.66%); 

2 . Ages 15-25 of both genders compose 68% of the users, with ages 26-40 following 
at 28%, ages 41-85 at 4%, and senior citizens (64-85 years old) at 1%; 

3. Homophily (i.e., birds-of-a-feather adage) is observed in the preferences of users 
with respect to age levels, such that they are strongly biased towards being friends 
with users of a similar age; 

4. There is heterophily in gender preference such that friendship among users of the 
opposite gender occurs more often. 

5. The friendship network is well-connected and robust to node removal, such that 
users can still reach other users through another friend’s circle of friends, even if 
another user leaves the network; 

6 . It exhibits a small-world characteristic with an average path length of 4.5 (max- 
imum=12) among connected users, shorter than the well-known six degrees of 
separation [10]; And 

7. The network exhibits a scale-free characteristics with heavily-tailed power-law dis¬ 
tribution (with the power A = —1.02 and = 0.84) suggesting the presence of 
many users acting as the network hubs. 

The methodology was successful in providing important data from the test-bed virtual 
community. This method can be used by several researchers in the fields of statistics, 
mathematics, physics, social sciences, and computer science. 
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Figure 1: A snapshot of the Friendster search tool showing all the available search 
parameters. 
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Figure 2: (Colored on digital copy) Number of users and percentage (a) by gender, (b) by 
relationship status, and (c) by gender and relationship status. 
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Figure 3: (Colored on digital copy) Number of users (a) by age, and (b) by age and 
relationship status. 
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Figure 4: (Colored on digital copy) Number of (a) female and male users by age, 
(b) female users by age and relationship status, and (c) male users by age 
and relationship status. 
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Figure 5: Frequency of male-male, male-female, and female-female relationships (where 
M=male and F=female). 



Figure 6: Frequency of relationships between users with respect to age differences. 
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Figure 7: Frequency of relationships with respect to relationship status (where S=single, 
M=married, IAR=in a relationship, and Unk=unknown). 



Number of friends 


Figure 8: Log-log plot of the number of friends x frequency obeys the power law distri¬ 
bution. The straight line is the power law line £t whose slope was found to be 
A = —1.02 with = 0.84. 
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